Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.

Description of variables

  • ID de caso: ID of the confirmed case.

  • Fecha de diagnóstico: Date in which the disease was diagnosed.

  • Ciudad de ubicación: City where the case was diagnosed.

  • Departamento o Distrito: Department or district where the city belongs to.

  • Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.

  • Edad: Age of the confirmed case.

  • Sexo: Sex of the confirmed case.

  • Tipo: How the person got infected: in Colombia, abroad or unknown.

  • País de procedencia: Country of origin if the person got infected abroad.

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

We had to clean the dataset:

  • We transformed the Fecha de diagnóstico variable into a Date type variable,

  • we fixed the variable Id de caso (since we removed some departments, so some lines, the numbers weren’t consecutive),

  • we created a variable Grupo de edad,

  • we cleaned the column País de procedencia (replaced cities with the country) and created the variable Continente de procedencia (as the first is too fragmented we thought to consider the continents).

##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           2020-03-06         Bogotá D.C.             Bogotá D.C.
## 2          2           2020-03-09            Medellín               Antioquia
## 3          3           2020-03-11            Medellín               Antioquia
##     Atención Edad Sexo        Tipo País de procedencia Grupo de edad
## 1 Recuperado   19    F   Importado              Italia         19_30
## 2 Recuperado   50    F   Importado              España         46_60
## 3 Recuperado   55    M Relacionado                 Nan         46_60

New dataset I

##          Date Elapsed time New cases/day Cumulative cases
## 1  2020-03-06            0             1                1
## 2  2020-03-09            3             1                2
## 3  2020-03-11            5             5                7
## 4  2020-03-12            6             2                9
## 5  2020-03-13            7             3               12
## 6  2020-03-14            8            14               26
## 7  2020-03-15            9            13               39
## 8  2020-03-16           10             8               47
## 9  2020-03-17           11            13               60
## 10 2020-03-18           12             9               69

New dataset II

##          Date Elapsed time Department Department ID New cases/day
## 1  2020-03-09            3  Antioquia             1             1
## 2  2020-03-11            5  Antioquia             1             3
## 3  2020-03-14            8  Antioquia             1             3
## 4  2020-03-15            9  Antioquia             1             1
## 5  2020-03-19           13  Antioquia             1             3
## 6  2020-03-20           14  Antioquia             1            11
## 7  2020-03-21           15  Antioquia             1             3
## 8  2020-03-22           16  Antioquia             1             5
## 9  2020-03-23           17  Antioquia             1            22
## 10 2020-03-25           19  Antioquia             1             8
##    Cumulative cases/Department Mean age
## 1                            1 50.00000
## 2                            4 35.66667
## 3                            7 30.00000
## 4                            8 55.00000
## 5                           11 52.33333
## 6                           22 39.81818
## 7                           25 31.00000
## 8                           30 45.40000
## 9                           52 36.45455
## 10                          60 29.75000

Exploring the dataset

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.


The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Other plots


The major number of cases are in the capital Bogotà.


Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).


The desease (number of cases) is more or less equally distributed across genders.


People from 31 to 45 years old are the most affected by the disease and people over 76 years old are the least affected. Colombia is a very young country. In 2018 the median age of the population was 30.4 years old and the largest age group is made of people from 25 to 54 years old, which comprises 41.98% of the population. (https://www.indexmundi.com/colombia/demographics_profile.html)

Age-Sex plot


There isn’t much difference between the sexes among the different group of ages.

Tipo

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   Tipo        `Total number` Percentage
##   <chr>                <int> <chr>     
## 1 Relacionado           8570 12%       
## 2 Importado              687 1%        
## 3 En Estudio           60406 87%

The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.

The frequentist approach

Train/test split

We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.

Poisson

Poisson with Elapsed time plus Sexo


## [1] "RMSE: 3892.80826581986"
## [1] "AIC: 20887.674847589"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 19687.03"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0416666666666667"

Poisson with Elapsed time, Age and Departments as predictors


## [1] "Estimated overdispersion 542039.960916532"
## [1] "RMSE: 8243.70282563776"
## [1] "AIC: 18085.2709276108"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 16854.63"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0583333333333333"

Poisson with Elapsed time, Age and Departments as predictors for New cases/day


## [1] "Estimated overdispersion 6785702.57486846"
## [1] "RMSE: 1225.26041309972"
## [1] "AIC: 2860.38819023372"
## [1] "Null deviance:  64369.08"   "Residual deviance: 1979.14"

Predictive accuracy of the Poisson model for New cases/day

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.15"

Poisson with Elapsed time, Elapsed time^2, Age and Departments as predictors


## [1] "Estimated overdispersion 1226447.182199"
## [1] "RMSE: 8441.28307878747"
## [1] "AIC: 8781.84768119021"
## [1] "Null deviance:  1740689.09" "Residual deviance: 7549.2"

Predictive accuracy of the Poisson model for Cumulative cases

## [1] "Frequency of coverage: 0.0333333333333333"

Autocorrelation to compare Poisson models

We generated 1000 samples from each of the four Poisson models and calculated the autocorrelation and compared against the autocorrelation of our original sample.


ANOVA to compare the Poisson models

## Analysis of Deviance Table
## 
## Model 1: `Cumulative cases` ~ `Elapsed time`
## Model 2: `Cumulative cases` ~ `Elapsed time` + Sexo_M
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+`
## Model 4: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       118      20716                          
## 2       117      19687  1  1029.35 < 2.2e-16 ***
## 3       113      19441  4   245.65 < 2.2e-16 ***
## 4       102      16855 11  2586.75 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quasi-Poisson

Negative Binomial

Negative Binomial with Elapsed time as predictor


## [1] "Estimated overdispersion 177.301055416919"
## [1] "RMSE: 1765.66109000166"
## [1] "AIC: 21911.1770461798"
## [1] "Null deviance:  1738722.15"  "Residual deviance: 20710.41"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0416666666666667"

Negative Binomial with Elapsed time plus Age as predictors


## [1] "RMSE: 3253.44941446475"
## [1] "AIC: 20645.2695885301"
## [1] "Null deviance:  1738975.56"  "Residual deviance: 19434.52"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0333333333333333"

Autocorrelation to compare Negative Binomial models

We generated 1000 samples from each of the four Negative Binomial models and calculated the autocorrelation and compared against the autocorrelation of our original sample.


Applying ANOVA to compare the negative binomial models

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Cumulative cases
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Elapsed time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                         `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3                                                                                                                             `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## 4 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n    `Departamento o Distrito_Tolima`
##      theta Resid. df    2 x log-lik.   Test    df LR stat. Pr(Chi)
## 1 11252780       118       -21905.18                              
## 2 12919543       113       -20629.27 1 vs 2     5 1275.907       0
## 3 13821345       107       -19242.25 2 vs 3     6 1387.022       0
## 4 13728063       102       -18042.49 3 vs 4     5 1199.758       0

The Bayesian approach

Poisson regression

As a first attempt, we fit a simple Poisson regression:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.

For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.

Posterior predictive check

poisson_posterior-1_new

The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.

Residual check

first_residual-1_new

The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).

The plot of the standardized residuals indicates a large amount of overdispersion.

Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson’s one.

Negative Binomial model

We try to improve the previous model using the Negative Binomial model:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

Where the parameter \(\phi\) is called precision and it is such that:

\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]

again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.

The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.

Accuracy across departments

NB_deps_new

We should take into account the differences across departments.

Multilevel Negative Binomial regression

We try to fit the following model, which also includes Age as covariat:

\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]

Accuracy across departments

NB2_dep_new

Hierarchical model

In order to improve the fit, we fit a model with department-specific intercept term.

So the varying intercept model that we take into account is now:

\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]

The priors used for the above model are the following:

\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]

being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).

New dataset

We added the following covariats into the dataset:

  • People: millions of inhabitants for each region;

  • Surface: \(km^3\), extent of each region;

  • Density: \(\frac{people}{km^2}\), density of the population in each region.

The model is:

Accuracy across departments

hier_deps_new

We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.

LOOIC

The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.

Plot the looic to compare models:

looic_new